Performing virtual non-verbal communication cues within a virtual environment of a video conference

ABSTRACT

A method for performing virtual non-verbal communication cues within a virtual environment of a video conference, the method may include: (i) sensing that a certain participant of a video conference performs non-verbal communication cues in relation to virtual items that appear in a version of a virtual video conference environment that is displayed on a display of the certain participant; (ii) determining, by a machine learning process, virtual non-verbal communication cues to be executed by a certain avatar that represents the certain participant, wherein the virtual non-verbal communication cues mimic the non-verbal communication cues performed by the certain participant; and (iii) assisting to perform the virtual non-verbal communication cues of the certain avatar within the virtual video conference environment.

CROSS REFERENCE

This application claims priority from U.S. patent application Ser. No. 17/249,468 filing date Mar. 2, 2021, which claims priority from U.S. provisional patent Ser. No. 63/023,836 filing date May 12, 2020, from U.S. provisional patent Ser. No. 63/081,860 filing date Sep. 22, 2020, and from U.S. provisional patent Ser. No. 63/199,014 filing date Dec. 1, 2020, all being incorporated herein in their entirety.

This application claims priority from U.S. patent application Ser. No. 17/304,378 filing date Jun. 20, 2021, and from U.S. patent application Ser. No. 17/539,036 filing date Nov. 30, 2021, all being incorporated herein in their entirety.

BACKGROUND

Video conference calls are very popular. They require that each participant has their own computerized system with a camera that is usually located close to a display.

Typically, several participants in a meeting are presented in separate small tiles and another tile may be used for sharing one of the participants' screen.

Each participant is typically shown with the background of their own office or with a virtual background of their selection.

There is a growing need to enhance the virtual interaction between participants and to overcome various other problems associated with current video conference call services.

SUMMARY

There may be provided a system, method and computer readable medium for performing virtual non-verbal communication cues within a virtual environment of a video conference.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an example of a method;

FIG. 2 illustrates an example of a computerized environment;

FIG. 3 illustrates an example of a computerized environment;

FIG. 4 illustrates an example of data structures;

FIG. 5 illustrates an example of a process for amending a direction of view of a 3D model of a part of a participant according to a direction of gaze of the participant;

FIG. 6 includes examples of methods;

FIG. 7 is an example of a panoramic view of a virtual 3D environment populated by five participants, a partial view of the some of the participants within the virtual 3D environment, and a hybrid view;

FIG. 8 illustrates two camera, fields of views of the two cameras, avatars and a displayed virtual 3D video conference environment (V3DVCE);

FIG. 9 illustrates two camera, fields of views of the two cameras, avatars and a displayed V3DVCE;

FIG. 10 illustrates two camera, fields of views of the two cameras, avatars and a displayed V3DVCE;

FIG. 11 illustrates two camera, fields of views of the two cameras, avatars and a displayed V3DVCE;

FIG. 12 illustrates a participant and an avatar;

FIG. 13 illustrates points of gaze of a participant and an avatar;

FIG. 14 illustrates an example of a method;

FIGS. 15A and 15B illustrates participants and their avatars;

FIG. 15C illustrates an example of a method; and

FIG. 15D illustrates an example of a method

FIG. 16A illustrates an example of a method;

FIG. 16B illustrates an example of a method;

FIG. 16C illustrates an example of a method;

FIG. 17A illustrated a picture of hair head;

FIG. 17B shows a virtual image;

FIG. 17C illustrates an example of method;

FIG. 17D illustrates an example of a method;

FIG. 17E illustrates an example of a method;

FIG. 18 illustrates an example of a method; and

FIG. 19A illustrates an example of an image of a participant making a “thumbs up” gesture and the virtual representation of the thumb up gesture as made by an avatar (not shown) of the participant.

DESCRIPTION OF THE DRAWINGS

In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure.

However, it will be understood by those skilled in the art that the present embodiments of the disclosure may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the present embodiments of the disclosure.

The subject matter regarded as the embodiments of the disclosure is particularly pointed out and distinctly claimed in the concluding portion of the specification. The embodiments of the disclosure, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings.

It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.

Because the illustrated embodiments of the disclosure may for the most part, be implemented using electronic components and circuits known to those skilled in the art, details will not be explained in any greater extent than that considered necessary as illustrated above, for the understanding and appreciation of the underlying concepts of the present embodiments of the disclosure and in order not to obfuscate or distract from the teachings of the present embodiments of the disclosure.

Any reference in the specification to a method should be applied mutatis mutandis to a system capable of executing the method and should be applied mutatis mutandis to a computer readable medium that is non-transitory and stores instructions for executing the method.

Any reference in the specification to a system should be applied mutatis mutandis to a method that may be executed by the system and should be applied mutatis mutandis to a computer readable medium that is non-transitory and stores instructions executable by the system.

Any reference in the specification to a computer readable medium that is non-transitory should be applied mutatis mutandis to a method that may be applied when executing instructions stored in the computer readable medium and should be applied mutatis mutandis to a system configured to execute the instructions stored in the computer readable medium.

The term “and/or” means additionally or alternatively.

Any reference to a “user” should be applied mutatis mutandis to the term “participant”—and vice versa.

There is provided a method, a non-transitory computer readable medium and a system related to video and may, for example be applicable to 3D video conference calls. At least some of the examples and/or embodiments illustrated in the applications may be applied mutatis mutandis for other purposes and/or during other applications.

For example, referring to a 3D video conference that involves multiple participants. A first participant is imaged, and a second participant wishes to view a first avatar (or any other 3D visual representation) of the first participant within a virtual 3D video conference environment.

The generation of the first avatar (or any other 3D visual representation) may be executed in various manners—for example—only by a device of the second participant, only by the device of the first participant, partially by the device of the second participant, partially by the device of the first participant, by a cooperation between the devices of the first and second participants, by another computerized system (such as but not limited to a cloud system or a remote system), and/or any combination of one or more devices.

The inclusion of the avatar (or any other 3D visual representation) within the virtual 3D video conference environment may be executed in various manners—for example—only by a device of the second participant, only by the device of the first participant, partially by the device of the second participant, partially by the device of the first participant, by a cooperation between the devices of the first and second participants, by another device (such as but not limited to a cloud device or a remote device), and/or any combination of one or more devices.

Any reference to one manner of execution of any step of the generation of the first avatar and/or any reference to one manner of execution of any step of the inclusion of the avatar within the virtual 3D video conference environment may be applied mutatis mutandis to any other manner of execution.

The generation of the first avatar and/or the inclusion of the first avatar may be responsive to information gained by the device of the first user or to a camera or sensor associated with the device of the first user. A non-limiting example of information may include information regarding the first participant and/or information regarding to the acquisition of images of the first participant (for example camera setting, illumination and/or ambient conditions).

The system may include multiple user devices and/or intermediate devices such as servers, cloud computers, and the like.

FIG. 1 illustrates an example of method 200.

Method 200 is for conducting a three-dimensional video conference between multiple participants.

Method 200 may include steps 210, 220 and 230.

Step 210 may include receiving direction of gaze information regarding a direction of gaze of each participant within a representation of a virtual 3D video conference environment that is associated with the participant.

The representation of a virtual 3D video conference environment that is associated with the participant is a representation that is shown to the participant. Different participants may be associated with different representation of a virtual 3D video conference environment.

The direction of gaze information may represent a detected direction of gaze of the participant.

The direction of gaze information may represent an estimated direction of gaze of the participant.

Step 220 may include determining, for each participant, updated 3D participant representation information within the virtual 3D video conference environment, that reflects the direction of gaze of the participant. Step 220 may include estimating how the virtual 3D video conference environment will be seen from the direction of gaze of the participant.

Step 230 may include generating, for at least one participant, an updated representation of virtual 3D video conference environment, the updated representation of virtual 3D video conference environment represents the updated 3D participant representation information for at least some of the multiple participants. Step 230 may include rendering images of the virtual 3D video conference environment for at least some of the multiple participants. Alternatively—step 230 may include generating input information (such as 3D model and/or one or more texture maps) to be fed to a rendering process.

Method 200 may also include step 240 of displaying, by a device of a participant of the multiple participants, an updated representation of the virtual 3D video conference environment, the updated representation may be associated with the participant.

Method 200 may include step 250 of transmitting the updated representation of virtual 3D video conference environment to at least one device of at least one participant.

The multiple participants may be associated with multiple participant devices, wherein the receiving and determining may be executed by at least some of the multiple participant devices. Any step of method 200 may be executed by at least some of the multiple participant device or by another computerized system.

The multiple participants may be associated with multiple participant devices, wherein the receiving and determining may be executed by a computerized system that differs from any of the multiple participant devices.

Method 200 may include one of more additional steps—collectively denoted 290.

The one or more additional steps may include at least one out of:

-   -   a. Determining a field of view of a third participant within the         virtual 3D video conference environment.     -   b. Setting a third updated representation of the virtual 3D         video conference environment that may be sent to a third         participant device to reflect the field of view of the third         participant.     -   c. Receiving initial 3D participant representation information         for generating the 3D representation of the participant under         different circumstances. The different circumstances may include         at least one out of (a) different image acquisition conditions         (different illumination and/or collection conditions), (b)         different directions of gaze, (c) different expressions, and the         like.     -   d. Receiving in run time, circumstances metadata; and amending,         in real time, the updated 3D participant representation         information based on the circumstances metadata.     -   e. Repetitively selecting for each participant, a selected 3D         model out of multiple 3D models of the participant.     -   f. Repetitively smoothing a transition from one selected 3D         model of the participant to another 3D model of the participant.     -   g. Selecting an output of at least one neural network of the         multiple neural networks based on a required resolution.     -   h. Receiving or generating participants appearance information         about head poses and expressions of the participants.     -   i. Determining the updated 3D participant representation         information to reflect the participant appearance information.     -   j. Determine a shape of each of the avatars that represent the         participants.     -   k. Determining relevancy of segments of updated 3D participant         representation information.     -   l. Selecting which segments to transmit, based on the relevancy         and available resources.     -   m. Generating a 3D model and one or more texture maps of 3D         participant representation information of a participant.     -   n. Estimating 3D participant representation information of one         or more hidden areas of a face of a participant.     -   o. Estimating 3D model hidden areas and one or more hidden parts         texture maps.     -   p. Determining a size of the avatar.     -   q. Receiving audio information regarding audio from the         participants and appearance information.     -   r. Synchronizing between the audio and the 3D participant         representation information.     -   s. Estimating face expressions of the participants based on         audio from the participants.     -   t. Estimating movements of the participants.

The receiving of the 3D participant representation information may be done during an initialization step.

The initial 3D participant representation information may include an initial 3D model and one or more initial texture maps.

The 3D participant representation information may include a 3D model and one or more texture maps.

The 3D model may have separate parameters for shape, pose and expression.

Each of the one or more texture maps may be selected and/or augmented based on at least one out of shape, pose and expression.

Each of the one or more texture maps may be selected and/or augmented based on at least one out of shape, pose, expression and angular relationship between a face of the participant and an optical axis of a camera that captures an image of face of the participant.

The determining, for each participant, of the updated 3D participant representation information may include at least one of the following:

-   -   a. Using one or more neural network for determining the updated         3D participant representation information.     -   b. Using multiple neural networks for determining the updated 3D         participant representation information, wherein different neural         networks of the multiple neural networks may be associated with         different circumstances.     -   c. Using multiple neural networks for determining the updated 3D         participant representation information, wherein different neural         networks of the multiple neural networks may be associated with         different resolutions.

The updated representation of the virtual 3D video conference environment may include an avatar per participant of the at least some of the multiple participants.

A direction of gaze of an avatar within the virtual 3D video conference environment may represent a spatial relationship between a (a) direction of gaze of a participant that may be represented by the avatar and (b) a representation of the virtual 3D video conference environment displayed to the participant.

The direction of gaze of an avatar within the virtual 3D video conference environment may be agnostic to an optical axis of a camera that captured a head of the participant.

An avatar of a participant within the updated representation of the virtual 3D video conference environment may appear in the updated representation of the virtual 3D video conference environment as being captured by a virtual camera located on a virtual plane that crosses the eyes of the first participant. Accordingly—the virtual camera and the eye may be located, for example at the same height.

The updated 3D participant representation information may be compressed.

The updated representation of the virtual 3D video conference environment may be compressed.

The generating of the 3D model and one or more texture maps may be based on images of the participant that were acquired under different circumstances.

The different circumstances may include different viewing directions of a camera that acquired the images, different poses, and different expressions of the participant.

The estimating of the 3D participant representation information of one or more hidden areas may be executed by using one or more generative adversarial networks.

The determining, for each participant, of the updated 3D participant representation information may include at least one out of:

-   -   a. Applying a super-resolution technique.     -   b. Applying noise removal.     -   c. Changing an illumination condition.     -   d. Adding or changing wearable item information.     -   e. adding or changing make up information.

The updated 3D participant representation information may be encrypted.

The updated representation of virtual 3D video conference environment may be encrypted.

The appearance information may be about head poses and expressions of the participants and/or be about lip movements of the participants.

The estimating face expressions of the participants based on audio from the participants may be executed by a neural network trained to map audio parameters to face expression parameters.

FIG. 2 illustrates an example of a computational environment that include users devices 4000(1)-4000(R) of users 4010(1)-4010(R). Index r ranges between 1 and R, R being a positive integer. The r′th user device 4000(r) may be any computerized device that may include one or more processing circuit 4001(r), a memory 4002(r), a man machine interface such as a display 4003(r), and one or more sensors such as camera 4004(r). The r′th user 4010(r) is associated with (uses) the r′th user device 4000(r). The camera may belong to the man machine interface.

The users devices 4000(1)-4000(R) and a remote computerized system 4100 may communicate over one or more networks such as network 4050. The one or more networks may be any type of networks—the Internet, a wired network, a wireless network, a local area network, a global network, and the like.

The remote computerized system may include one or more processing circuits 4101(1), a memory 4101(2), and may include any other component.

Any one of the users devices 4000(1)-4000(R) and a remote computerized system 4100 may participate in the execution of any method illustrated in the specification. Participate means executing at least one step of any of said methods.

Any processing circuit may be used—one or more network processors, non-neural network processors, rendering engines, image processors and the like.

One or more neural networks may be located at a user device, at multiple users devices, at a computerized system outside any of the user devices, and the like.

FIG. 3 illustrates an example of a computational environment that include users devices 4000(1)-4000(R) of users 4010(1)-4010(R). Index r ranges between 1 and R, R being a positive integer. The r′th user device 4000(r) may be any computerized device that may include one or more processing circuit 4001(r), a memory 4002(r), a man machine interface such as a display 4003(r), and one or more sensors such as camera 4004(r). The r′th user 4010(r) is associated with (uses) the r′th user device 4000(r).

The users devices 4000(1)-4000(R) may communicate over one or more networks such as network 4050.

Any one of the users devices 4000(1)-4000(R) may participate in the execution of any method illustrated in the specification. Participate means executing at least one step of any of said methods.

FIG. 4 illustrates an example of various data structures. The data structures may include user avatars 4101(l)-4101(j), texture maps 4102(l)-4102(k), 3D models 4103(l)-4103(m), 3D representations of objects 4104(l)-4104(n), and any mapping or other data structures mentioned in the application.

Any user may be associated with one or more data structure of any type—avatar, 3D model, texture map, and the like.

Some of the examples refer to a virtual 3D video conference environment such as a meeting room, restaurant, cafe, concert, party, external or imaginary environment in which the users are set. Each participant may choose or be otherwise associated with a virtual or actual background and/or may select or otherwise receive any virtual or actual background in which avatars related to at least some of the participants are displayed. The virtual 3D video conference environment may include one or more avatars that represents one or more of the participants. The one or more avatars may be virtually located within the virtual 3D video conference environment. One or more features of the virtual 3D video conference environment (that may or may not be related to the avatars) may differ from one participant to another.

Either the full body, the upper part of the body or just the face of the users are seen in this environment—thus an avatar may include full body of a participant, the upper part of a body of the participant body or just the face of the participant.

Within the virtual 3D video conference environment there may be provided an improved visual interaction between users that may emulate the visual interaction that exists between actual users that are actually positioned near each other. This may include creating or ceasing to have eye-contact, expressions directed at specific users and the like.

In a video conference call between different users, each user may be provided with a view of one or more other users—and the system may determine (based on gaze direction and the virtual environment)—where the user looks (for example at one of the other users—at none of the users, at a screen showing a presentation, at a whiteboard, etc.)—and this is reflected by the virtual representation (3D model) of the user within the virtual environment—so that other users may determine where the user is looking.

FIG. 5 illustrates an example of a process for amending a direction of view of an avatar of a part of a participant according to a direction of gaze of the participant. The upper part of FIG. 5 is a virtual 3D video conference environment—represented by a panoramic view 41 of five participants 51, 52, 53, 54 and 55 sitting near table 60. All participants face the same direction—the screen.

In the lower image the avatar of the fifth participant faces the avatar of first participant—as the fifth participant was detected to look at the 3D model of the first participant within the environment as presented to the fifth participant.

Tracking the user's eyes and gaze direction may also be used to determine the direction in which the user is looking (direction of gaze) and at which person or object the user is looking. This information can be used to rotate the avatar's head and eyes so that in the virtual space it also appears as if the user is looking at the same person or object as in the real world.

Tracking the user's head pose and eye gaze may also be used to control the virtual world's appearance on the user's screen. For example, if the user looks at the right side of the screen, the point of view of the virtual camera may move to the right, so that the person or object at which the user is looking is located at the center of the user's screen.

The rendering of a user's head, body, and hands from a certain point of view that is different than the original point of view of the camera may be done in different ways, as described below:

In one embodiment, a 3D model and texture maps are created before the beginning of the meeting and this model is then animated and rendered at run time according to the user's pose and expressions that are estimated from the video images.

A texture map is a 2D image in which each color pixel represents the red, green and blue reflectance coefficients of a certain area in the 3D model. An example of a texture map is shown in FIG. 20. Each color pixel in the texture map corresponds to certain coordinates within a specific polygon (e.g., triangle) on the surface of the 3D model.

Generally, each pixel in the texture map has an index of the triangle to which it is mapped and 3 coordinates defining its exact location within the triangle.

A 3D model composed of a fixed number of triangles and vertices may be deformed as the 3D model changes. For example, a 3D model of a face may be deformed as the face changes its expression. Nevertheless, the pixels in the texture map correspond to the same locations in the same triangles, even though the 3D locations of the triangles change as the expression of the face changes.

Texture maps may be constant or may vary as a function of time, expression or of viewing angle. In any case, the correspondence of a given pixel in a texture map and a certain coordinate in a certain triangle in the 3D model doesn't change.

In yet another embodiment, a new view is created based on a real-time image obtained from a video camera and the position of the new point of view (virtual camera).

In order to best match between the audio and the lip movement and facial expressions, the audio and video that is created from the rendering of the 3D models based on the pose and expressions parameters are synchronized. The synchronization may be done by packaging the 3D model parameters and the audio in one packet corresponding to the same time frame or by adding time stamps to each of the data sources.

To further improve the natural appearance of the rendered model, a neural network may be trained to estimate the facial expression coefficients based on the audio. This can be done by training the neural network using a database of videos of people talking and the corresponding audio of this speech. The videos may be of the participant that should be represented by an avatar or of other people. Given enough examples, the network learns the correspondence between the audio (i.e. phonemes) and the corresponding face movements, especially the lip movements. Such a trained network would enable to continuously render the facial expressions and specifically the lip movements even when the video quality is low or when part of the face is obstructed to the original video camera.

In yet another embodiment, a neural network can be trained to estimate the audio sound from the lip and throat movements or from any other facial cues, as is done by professional lip readers. This would enable to create or improve the quality of the audio when the audio is broken or when there are background noises that reduce its quality.

In yet another embodiment a neural network is trained to compress audio by finding a latent vector of parameters from which the audio can be reconstructed at a high quality. Such a network could serve to compress audio at a lower bit rate than possible with standard audio compression methods for a given audio quality or obtain a higher audio quality for a given bit rate.

Such a network may be trained to compress the audio signal to a fixed number of coefficients, subject to the speech being as similar as possible to the original speech under a certain cost function.

The transformation of the speech to a set of parameters may be a nonlinear function and not just a linear transformation as is common in standard speech compression algorithms. One example would be that the network would need to learn and define a set of basis vectors which form a spanning set of spoken audio.

The parameters then would be the vectorial coefficients of the audio as spanned by this set.

FIG. 6 illustrates method 2001.

Method 2001 is for conducting a 3D video conference between multiple participants, the method may include steps 2011 and 2021.

Step 2011 may include determining, for each participant, updated 3D participant representation information within the virtual 3D video conference environment, that represents participant. The determining may be based on audio generated by the participants and appearance information about appearance of the participants.

Step 2021 may include generating, for at least one participant, an updated representation of virtual 3D video conference environment, the updated representation of virtual 3D video conference environment represents the updated 3D participant representation information for at least some of the multiple participants. For example, any movement by the participant may expose or collude parts of the environment. Additionally, movements by participant may affect lighting in the room as the movements may modify the exposure to light of different parts of the environment.

The method may include matching between the audio from a certain participant and appearance information of a certain participant.

The appearance information may be about head poses and expressions of the participants.

The appearance information may be about lip movements of the participants.

Communications System Based on the 3D Models.

During the communication session, i.e., a 3D video conference call between several users, a 2D or 3D camera (or several cameras) grabs videos of the users. From these videos a 3D model (for example—the best fitting 3D model) of the user may be created at a high frequency, e.g., at a frame rate of 15 to 120 fps.

Temporal filters or temporal constraints in the neural network may be used to assure a smooth transition between the parameters of the model corresponding to the video frames in order to create a smooth temporal reconstruction and avoid jerkiness of the result.

The real-time parametric model together with the reflectance map and other maps may be used to render a visual representation of the face and body that may be very close to the original image of the face and body in the video.

Since this may be a parametric model, it may be represented by a small number of parameters. Typically, less than 300 parameters may be used to create a high-quality model of the face including each person's shape, expression and pose.

These parameters may be further compressed using quantization and entropy coding such as a Huffman or arithmetic coder.

The parameters may be ordered according to their importance and the number of parameters that may be transmitted and the number of bits per parameter may vary according to the available bandwidth.

In addition, instead of coding the parameters' values, the differences of these values between consecutive video frames may be coded.

The model's parameters may be transmitted to all other user devices directly or to a central server. This may save a lot of bandwidth as instead of sending the entire model of the actual high-quality image during the entire conference call—much fewer bits representing the parameters may be transmitted. This may also guarantee a high quality of the video conference call, even when the current available bandwidth may be low.

Transmitting the model parameters directly to the other users instead of via a central server may reduce the latency by about 50%.

The other user devices may reconstruct the appearance of the other users from the 3D model parameters and the corresponding reflectance maps. Since the reflectance maps, representing such things as a person's skin color change very slowly, they may be transmitted only once at the beginning of the session or at a low updating frequency according to changes that occur in these reflectance maps.

In addition, the reflectance maps and other maps may be updated only partially, e.g., according to the areas that have changed or according to semantic maps representing body parts. For example, the face may be updated but the hair or body that may be less important for reconstructing emotions may not be updated or may be updated at a lower frequency.

In some cases, the bandwidth available for transmission may be limited. Under such conditions, it may be useful to order the parameters to transmit according to some prioritization and then transmit the parameters in this order as the available bandwidth allows. This ordering may be done according to their contribution to the visual perception of a realistic video. For example, parameters related to the eyes and lips may have higher perceptual importance than those related to cheeks or hair. This approach would allow for a graceful degradation of the reconstructed video.

The model parameters, video pixels that may be not modelled and audio may be all synchronized.

As a result, the total bandwidth consumed by the transmission of the 3D model parameters may be several hundred bits per second and much lower than the 100 kbps-3 Mbps that may be typically used for video compression.

A parametric model of the user's speech may also be used to compress the user's speech beyond what may be possible with a generic speech compression method. This would further reduce the required bandwidth required for video and audio conferencing. For example, a neural network may be used to compress the speech into a limited set of parameters from which the speech can be reconstructed. The neural network is trained so that the resulting decompressed speech is closest to the original speech under a specific cost function. The neural network may be a nonlinear function, unlike linear transformations used in common speech compression algorithms.

The transmission of bits for reconstructing the video and audio at the receiving end may be prioritized so that the most important bits may be transmitted or receive a higher quality of service. This may include but may not be limited to prioritizing audio over video, prioritizing of the model parameters over texture maps, prioritizing certain areas of the body or face over others, such as prioritizing information relevant to the lips and eyes of the user.

An optimization method may determine the allocation of bitrate or quality of service to audio, 3D model parameters, texture maps or pixels or coefficients that may be not part of the model in order to ensure an overall optimal experience. For example, as the bitrate is reduced, the optimization algorithm may decide to reduce the resolution or update frequency of the 3D model and ensure a minimal quality of the audio signal.

The users may be provided with one or more views of the virtual 3D video conference environment—whereas the user may or may not select the field of view—for example, a field of view that includes all of the other users or only one or some of the users, and/or may select or may view one or some objects of the virtual 3D video conference environment such as TV screens, whiteboards, etc.

When combining the video pixels and the rendered 3D models, the areas corresponding to the model, the areas corresponding to the video pixels, or both may be processed so that the combination may appear natural and a seam between the different areas would not be apparent. This may include but may be not limited to relighting, blurring, sharpening, denoising or adding noise to one or some of the image components so that the whole image appears to originate from one source.

Each user may use a curved screen or a combination of physical screens to that the user in effect can see a panoramic image showing a 180 or 360 degree view (or any other angular range view) of the virtual 3D video conference environment and/or a narrow field of view image focusing on part of the virtual 3D video conference environment such as a few people, one person, only part of a person, i.e. the person's face, a screen or a whiteboard or any one or more parts of the virtual 3D video conference environment.

The user will be able to control the part or parts of the narrow field of view image or images by using a mouse, a keyboard, a touch pad or a joystick or any other device that allows to pan and zoom in or out of an image.

The user may be able to focus on a certain area in the virtual 3D video conference environment (for example a panoramic image of the virtual 3D video conference environment) by clicking on the appropriate part in the panoramic image.

FIG. 7 illustrates an example of a panoramic view 41 of the virtual 3D video conference environment populated by five participants and a partial view 42 of the some of the participants within the virtual 3D video conference environment. FIG. 7 also illustrates a hybrid view 43 that includes a panoramic view (or a partial view) and expanded images of faces of some of the participants.

The user may be able to pan or zoom using head, eyes, hands, or body gestures. For example, by looking at the right or left part of the screen, the focus area may move to the left or right, so it appears at the center of the screen, and by leaning forward or backwards the focus area may zoom in or out.

The 3D model of the person's body may also assist in correctly segmenting the body and the background. In addition to the model of the body, the segmentation method will learn what objects may be connected to the body, e.g., a person may be holding a phone, pen or paper in front of the camera. These objects will be segmented together with the person and added to the image in the virtual environment, either by using a model of that object or by transmitting the image of the object based on a pixel level representation. This may be in contrast to existing virtual background methods that may be employed in existing video conferencing solutions that may not show objects held by users as these objects are not segmented together with the person but rather as part of the background that has to be replaced by the virtual background.

Segmentation methods typically use some metric that needs to be exceeded in order for pixels to be considered as belonging to the same segment. However, the segmentation method may also use other approaches, such as Fuzzy Logic, where the segmentation method only outputs a probability that pixels belong to the same segment. If the method detects an area of pixels with a probability that makes it unclear if it and it is not sure whether the area should be segmented as part of the foreground or background, the user may be asked how to segment this area.

As part of the segmentation process, objects such as earphones, cables connected to the earphones, microphones, 3D glasses or VR headsets may be detected by a method. These objects may be removed in the modelling and rendering processes so that the image viewed by viewers does not include these objects. The option to show or eliminate such objects may be selected by users or may be determined in any other manner—for example based on selection previously made by the user, by other users, and the like.

If the method detects more than one person in the image, it may ask the user whether to include that person or people in the foreground and in the virtual 3D video conference environment or whether to segment them out of the image and outside of the virtual 3D video conference environment.

In addition to using the shape or geometrical features of objects in order to decide whether they may be part of the foreground or background, the method may also be assisted by knowledge about the temporal changes of the brightness and color of these objects. Objects that do not move or change have a higher probability of being part of the background, e.g., part of the room in which the user may be sitting, while areas where motion or temporal changes may be detected may be considered to have a higher probability of belonging to the foreground. For example, a standing lamp would not be seen as moving at all and it would be considered part of the background. A dog walking around the room would be in motion and considered part of the foreground, In some cases periodic repetitive changes or motion may be detected, for example where a fan rotates, and these areas may be considered to have a higher probability of belonging to the background.

The system will learn the preferences of the user and use the feedback regarding which objects, textures or pixels may be part of the foreground and which may be part of the background and use this knowledge in order to improve the segmentation process in the future. A learning method such as a Convolutional Neural Network or other machine learning method may learn what objects may be typically chosen by users as parts of the foreground and what objects may be typically chosen by users as part of the background and use this knowledge to improve the segmentation method.

The processing of this system may be performed on the user's device such as a computer, a phone or a tablet or on a remote computer such as a server on the cloud. The computations may also be divided and/or shared between the user's device and a remote computer, or they may be performed on the user's device for users with appropriate hardware and on the cloud (or in any other computation environment) for other users.

The estimation of the body and head parameters may be done based on compressed or uncompressed images. Specifically, they can be performed on compressed video on a remote computer such as a central computer on the cloud or another user's device. This would allow normal video conferencing systems to send compressed video to the cloud or another user's computer where all the modelling, rendering and processing would be performed.

Gaze Detection in Video Conferencing

Video conferencing is a leading method for executing meetings of all kinds. This is especially true with the globalization of working environments and has been enhanced with the appearance of the Covid-19 virus.

With the increase of importance of video conferencing systems, new methods of implementing them are being introduced. These include 3D environments, where the video conference appears to be held in a virtual setting. The participants also appear as 3D figures within the virtual environment, usually represented as avatars. In order for this kind of system to give participants a sensation of a real face-to-face meeting, it is important to understand where each participant is looking and to have the avatar look at the same place and with the same head orientation and movements as detailed below.

Prior art solutions are limited to understanding of where viewers look at the screen.

DOF—Degrees of Freedom

6 DOF—relative to a coordinate system, a person's head can have 6 degrees of freedom. Three of these are the X, Y and Z location of a predefined point in the head (e.g., the tip of the nose or the right extreme point of one of the eyes, etc.) The other three degrees of freedom are rotations around these axes. These are often known as Pitch, Yaw and Roll.

8 DOF—in addition to the 6 DOF, there are two additional degrees of freedom that help define a person's gaze. These additional degrees of freedom are necessary because the eyes do not necessarily look directly forward at all times.

Therefore, one needs to add two rotations of the eyes (Pitch and Yaw). In the most general case, one can say that each eye will have different values for these parameters.

Therefore, the most accurate description would actually be 10 DOF but for the sake of the document, only 8 DOF will be dealt with. In case a person looks at objects that are not in the immediate vicinity of the eyes, one can assume that both eyes have the same values for these parameters. The reduction from 10 DOF to 8 DOF can be done by averaging the values for both eyes or by taking the values of only one of the eyes. All that is written below can be applied to 10 DOF models.

There are known methods for determining where in the screen is the participant looking at. See for example http://developer.tobiipro.com/commonconcepts/calibration.html

Solutions such as these only deal with understating at which point in the screen the viewer is looking. They are accomplished by calibrating the sight of the viewer as seen by the camera, with known coordinates of the screen.

Information about the screen size, or specifically the size of the window that is viewed by the viewer can be supplied by all operating systems or can be inferred by information about the screen size and window attributes within the screen.

In order to calculate the line of sight, one needs to find the 8 DOF parameters of the participant and combine that with the point on the screen with the participant is looking at.

The 6 DOF parameters can be obtained in the following manner: X and Y are relative to the camera's coordinates. Z can be obtained by one of the following methods:

-   -   a. For calibration purposes, ask the participant to sit at a         defined distance from the camera. This is a one-time process.         Following this, Z can be calculated by changes in the size of         the head as viewed by the camera.     -   b. Use a depth camera. These are more and more ubiquitous         nowadays.     -   c. Infer the participant's distance from the camera by the size         of the participant's head as captured by the camera and compared         to an average human's head size. Average numbers can be         obtained, for example, here:         https://en.wikipedia.org/wiki/Human_head     -   d. Assume that the participant is located at a certain distance         from the camera (e.g., 55 cm)

The three additional DOF are then easily obtained. This involves finding the Euclidean matrix which describes the movement of the head and is well known in computer graphics and in other areas.

The additional DOF for the eyes can then be found by comparing the pupil locations relative to the center of the eyes.

In order to calculate the line of sight, one assumes a virtual pinhole camera (VCV) located at the geometrical point which is on the participant's face between the participant's eyes. A line is then calculated which joins that virtual camera with the point on the screen the viewer is looking at. Note that, since we are dealing with a virtual 3D video conferencing setting, this virtual camera is also used as a virtual camera (VCP) when deciding what to present to the viewer on the viewer's screen from within the 3D environment. Therefore, the line of sight is also the line of sight within the 3D environment. Under some circumstances and in order to reduce the amount of changes of what is presented to the viewer, VCP may be less prone to movements than VCV and may be located at a slightly different location. Even in these cases, the location of VCP is known and it is straightforward to translate the viewers line of sight from VCV coordinates to a line of sight in the VCP coordinates.

Finding the line of sight is followed by determining what is the viewer looking at. This can be answered by finding the opaque object along the line of sight which is closest to VCV along the line of sight. In order to reduce possible miscalculations, it may be possible to assume that the viewer is looking at a face along or closest to the line of sight.

FIG. 8 illustrates an example where the first and second cameras VCV 4421 and VCP 4422 are mutually aligned. VCV 4421 has an optical axis 4426 and a VCV field of view 4423. VCP 4422 has an optical axis 4427 and a VCP field of view 4424.

FIG. 8 also illustrates avatars 4401, 4402, 4403, 4404 and 4405 of five participants within the virtual 3D video conference environment (V3DVCE), the display 4425 (as virtually appears in the V3DVCE), and an image 4410 that illustrates the displayed representation of the V3DVCE in which the five avatars are illustrated.

FIG. 9 illustrates an example where the first and second cameras VCV 4421 and VCP 4422 are misaligned but have their optical axes and field of views spaced apart but parallel to each other.

FIG. 10 illustrates an example where the first and second cameras VCV 4421 and VCP 4422 are misaligned but have their optical axes and field of views spaced apart and non-parallel to each other. Misalignment between the VCV and the VCP may occur. Misalignment between them may mean that the optical axis of the VCV is misaligned with the optical axis of the VCP. When this occurs, a gaze related object (or a part of said gaze related object) that can be fully seen by one camera, may be at least partially concealed from the other camera.

FIG. 11 illustrates an example where the first and second cameras VCV 4421 and VCP 4422 are misaligned, their optical axes and field of views spaced apart and non-parallel to each other, an avatar 4405 (example of a gaze related object) is fully seen by VCP 4422 and is partially concealed (part 4405′ is not shown)—due to the presence of another gaze related object 4408.

FIG. 12 illustrates a participant 4451 that looks at three different points of time, at points 4461, 4462 and 4463 within the V3DVCE, point 4464 is positioned on the face of avatar 4451, points 4461 and 4463 are located in proximity to avatar 4451—and the method may conclude that the participant looks at the avatar 4451.

FIG. 13 illustrates different points of gaze of a participant (appearing as points 4465 and 4465 on an avatar 4451 of another participant. The participant has its gaze oscillate between points 4465 and 4466 (on avatar 4451). By applying a temporal filter and/or smoothing—the gaze may be set (for purposes of generating the next displayed version of the V3DVCE) on one point—being one of the points, average point between points 4465 and 4466.

FIG. 14 illustrates method 4500 for virtual 3D video conference environment generation.

Method 4500 may start by steps 4510 and 4520.

Step 4510 may include determining a first optical axis of a first virtual camera, the first optical axis represents a line of sight of the participant while a participant of the 3D video conference environment looks at a current displayed version of a virtual 3D video conference environment (V3DVCE). A current displayed version of the V3DVCE is displayed on a display.

The first virtual camera may be virtually positioned at a geometrical point between both eyes of a participant and on a face of the participant.

Step 4510 may include at least one out of:

-   -   a. Applying a temporal filter on multiple intermediate         determinations of the first optical axis, made during a certain         time period.     -   b. Applying a smoothing operation on multiple intermediate         determinations of the first optical axis, made during a certain         time period.     -   c. Applying a temporal filter on multiple intermediate         determinations of a second optical axis, made during the certain         time period.

Step 4520 may include determining a second optical axis of a second virtual camera that virtually captures the V3DVCE to provide the current displayed version of the V3DVCE.

The V3DVCE may be displayed in correspondence to the second optical axis.

Steps 4510 and 4520 may be followed by step 4530 of generating a next displayed version of the V3DVCE based on at least one of the first optical axis and the second optical axis.

Steps 4510, 4520 and 4530 may be repeated multiple times—for example during the duration of the 3D video conference. Steps 4510, 4520, 4530 may be repeated each video frame, each multiple video frames, one to tens frames per second, once per second, once per multiple seconds, and the like.

Step 4530 may include at least one out of:

-   -   a. Comparing the second optical axis to the estimate of the line         of sight of the participant within V3DVCE. The line of sight may         have a first part outside the display.     -   b. The comparing may include calculating an estimate of the         second optical axis outside the display.     -   c. Comparing the line of sight to the estimate of the second         optical axis outside the display.     -   d. Determining an intersection pixel of the display that         intersects with the first optical axis.     -   e. Searching for a potential object of interest that is         virtually positioned within the V3DVCE in proximity to the line         of sight within the V3DVCE, and determining a content of the         next displayed version based on the potential object of         interest. The potential object of interest may include an         avatar. The potential object of interest may not be intersected         by the line of sight.     -   f. Virtually amending the line of sight to virtually intersect         with the potential object of interest.     -   g. Determining one or more gaze related objects. A gaze related         object is an object that is located within a field of view of         the participant, as represented by the direction of gaze of the         participant.     -   h. Determining whether a gaze related object of the one or more         gaze related objects at least partially conceals another gaze         related object of the one or more gaze related objects. There         may be an angular difference between the first optical axis and         the second optical axis. The estimate of the first optical axis         in the V3DVCE is an angular difference compensated estimate of         the line of sight within the V3DVCE. Step 4530 may include         compensating for an angular difference between the first optical         axis and the second optical axis

The one or more gaze related objects may include:

-   -   a. At least one object that intersects with the estimate of the         first optical axis in the V3DVCE.     -   b. At least one object that is a face of an avatar of a         participant that is located in proximity to the estimate of the         first optical axis in the V3DVCE.     -   c. At least one object of interest within in the V3DVCE.

In the foregoing specification, the embodiments of the disclosure have been described with reference to specific examples of embodiments of the disclosure. It will, however, be evident that various modifications and changes may be made therein without departing from the broader spirit and scope of the embodiments of the disclosure as set forth in the appended claims.

Moving avatars in a virtual 3D video conference environment (V3DVCE).

Examples of providing and maintaining a V3DVCE were illustrated in at least one out of (a) U.S. patent application Ser. No. 17/249,468 filing date Mar. 2, 2021, (b) U.S. provisional patent Ser. No. 63/023,836 filing date May 12, 2020, (c) U.S. provisional patent Ser. No. 63/081,860 filing date Sep. 22, 2020, (d) U.S. provisional patent Ser. No. 63/199,014 filing date Dec. 1, 2020, (e) U.S. patent application Ser. No. 17/304,378 filing date Jun. 20, 2021, or (f) U.S. patent application Ser. No. 17/539,036 filing date Nov. 30, 2021, all being incorporated herein by reference. Other V3DVCEs may be provided.

There is a growing need to represent participants in the V3DVCE in a reliable manner—especially while performing natural moves so that the setting fully recreates a feeling of viewing real people interacting within the virtual conference.

When creating avatars and having them move and replicate the human behaviour, it is beneficial to have them perform movements which seem natural.

The real participants of the virtual conference would be typically seated in front of a screen or a set of screens and would therefore perform only slight physical movements when moving their gaze from one point to another—for example from gazing at one participant to another participant—as both would appear to the participant within the screen and, therefore, within a small angular difference.

Such a setting is shown in FIG. 15A. The left side of the figure shows two participant avatars (Second avatar 5342 and Third avatar 5343) as shown on the screen 5370 of First participant 5361. The right side of the figure shows a view from the top of First participant 5361 and the screen 5370. As one can see, there is a relatively small angle between a (displayed) location 5352 of Second avatar 5342 and a (displayed) location 5353 of Third avatar 5343 as seen by First participant 5361.

Suppose that within the virtual environment in which the conference virtually takes place the participant avatars are located as seen in FIG. 15B.

In FIG. 15B, one can see that the angular difference between Second avatar 5342 and Third avatar 5343 as seen by First avatar 5341 is much larger than as they appear to be seen by First participant 5361 in FIG. 15A.

Suppose First participant 5361, while looking at the screen, shifts his or her gaze from Second avatar 5342 to Third avatar 5343. Since they both appear on the screen, First participant 5361 can do this by a slight movement of the eyes, or may accompany this by an even slight movement of the head. However, reflecting this change of gaze within the virtual environment, having First avatar 5341 changing its gaze from Second avatar 5342 to gazing at Third avatar 5343, if done only by a movement of first avatar eyes and perhaps a slight movement of its head, would appear very unnatural. A different kind of adjustment is needed.

While this description discusses a lateral movement, it should be understood that it is correct for a movement on any axis or combination of axes.

In general, a lateral movement of a figure's gaze can be performed by a one or a combination of two to four movements: Movement of the body, Movement of the upper torso, Movement of the head, and Movement of the eyes.

Let the angular movement of the body be denoted by b, that of the torso by t, that of the head by h and that of the eyes by e.

In this case, the sum s=(b+t+h+e) is equal to the movement of the gaze of the avatar.

So, if the avatar is required to move its gaze by an angle of a, a combination of b, t, h and e should be chosen so that their sum, s, equals to a and they create a natural movement—one that looks like humans perform.

In order to find out what constitutes a normal and natural movement of humans, machine learning techniques can be employed.

For example, a machine learning process can be trained to learn how lateral movements of humans are separated into these four components. Such training can be performed for each specific individual and later applied to the avatar that reflects that individual. Alternatively, a network can be trained on a large population and apply its processing to any participating avatar.

Choosing the population for training may be based on common feature or features of the population—such as gender, height, age, etc. In this case, the avatar's features would determine how to apply the learned breakup of the movements to it. Additionally, this can be performed without segmenting the population at all.

FIG. 15C illustrates an example of method 5400 for changing a direction of gaze of an avatar within a virtual environment of a video conference.

Method 5400 may start by step 5410 of sensing that a certain participant of a video conference changes his direction of gaze from a first virtual item to a second virtual item. The sensing may involve image processing of images acquired by a camara that captures the participant.

The first virtual item and the second virtual item appear in a version of a virtual video conference environment that is displayed on a display of the certain participant.

The first virtual item may be a first participant. The second virtual item may be a second participant. The first virtual item and/or the second virtual item may differ from any of the participants of the virtual video conference.

Step 5410 may be followed by step 5420 of determining, by a machine learning process, at least one virtual movement of a certain participant avatar within the virtual environment that once applied shifts a direction of gaze within the virtual environment, of the certain participant avatar from the first virtual item to the second virtual item.

The certain participant avatar represents the certain participant.

The at least one virtual movement is selected out of (a) a virtual movement of eyes of the certain participant avatar, (b), a virtual movement of a head of the certain participant avatar (c), a virtual movement of a torso of the certain participant avatar, or (d) a virtual movement of a body of the certain participant avatar.

Step 5420 may be followed by step 5430 of responding to the determining.

The responding may include at least one out of:

-   -   a. Implementing the at least one movement—for example moving the         certain participant avatar according to the determination.     -   b. Transmitting information regarding the determining to one or         more other computerized systems that are involved in the 3D         video conference. The computerized system may be a user device,         a remote computerized system and the like.

The machine learning process may undergo a certain participant (also referred to an individual) training process during which the machine learning process is trained by a training process that comprises receiving videos of the certain participant while performing at least one actual movement; wherein the at least one actual movement is selected out of (a) a virtual movement of eyes of the certain participant avatar, (b), a virtual movement of a head of the certain participant avatar (c), a virtual movement of a torso of the certain participant avatar, or (d) a virtual movement of a body of the certain participant avatar.

Alternatively, the machine learning process may undergo a non-specific training process during which the machine learning process is trained by a training process that comprises receiving videos of multiple persons while performing at least one actual movement; wherein the at least one actual movement is selected out of (a) a virtual movement of eyes of the certain participant avatar, (b), a virtual movement of a head of the certain participant avatar (c), a virtual movement of a torso of the certain participant avatar, or (d) a virtual movement of a body of the certain participant avatar.

FIG. 15D illustrates an example of method 5450 for changing a direction of gaze of an avatar within a virtual environment of a video conference.

Method 5450 may start by step 5410 of sensing that a certain participant of a video conference changes his direction of gaze from a first virtual item to a second virtual item. The sensing may involve image processing of images acquired by a camara that captures the participant.

The first virtual item and the second virtual item appear in a version of a virtual video conference environment that is displayed on a display of the certain participant.

The first virtual item may be a first participant. The second virtual item may be a second participant. The first virtual item and/or the second virtual item may differ from any of the participants of the virtual video conference.

Step 5410 may be followed by step 5460 of determining, by a machine learning process, at least one virtual movement of a certain participant avatar within the virtual environment that once applied shifts a direction of gaze within the virtual environment, of the certain participant avatar from the first participant avatar to the second virtual item.

The least one virtual movement may include at least one out of (a) a virtual movement of eyes of the certain participant avatar, or (b) one or more additional virtual movements that differ from the virtual movement of eyes of the certain participant avatar.

The at least one virtual movement may include the virtual movement of eyes of the certain participant avatar and one or more additional virtual movements.

The one or more additional virtual movements may include a combination of one or more of (a) a virtual movement of a torso of the certain participant avatar, (b) a virtual movement of a head of the certain participant avatar, or (c) a virtual movement of a body of the certain participant avatar.

Step 5460 may be followed by step 5470 of responding to the determining.

The responding may include at least one out of:

-   -   a. Implementing the at least one movement—for example moving the         certain participant avatar according to the determination.     -   b. Transmitting information regarding the determining to one or         more other computerized systems that are involved in the 3D         video conference. The computerized system may be a user device,         a remote computerized system and the like.

The machine learning process may have been trained by a training process that includes receiving videos of the certain participant while performing at least one actual movement; wherein the at least one actual movement comprises at least one out of (a) an actual movement of eyes of the certain participant, or (b) one or more additional actual movements that differ from the actual movement of eyes of the certain participant.

The machine learning process may have been trained by a training process that comprises receiving videos of multiple persons while performing at least one actual movement; wherein the at least one actual movement comprises at least one out of (a) an actual movement of eyes of the multiple persons, or (b) one or more additional actual movements that differ from the actual movement of eyes of the multiple persons.

There may be a difference between at least one actual movement of the certain participant that led to the change to the direction of gaze and between the at least virtual movement of the certain participant avatar.

The determining of the at least one virtual movement of the certain participant avatar can be made regardless of the at least one actual movement of the certain participant.

Alpha Channel Information

The creation of credible avatars within virtual environments required a good blending of the avatar image with the virtual background.

Specifically, the problem mostly manifests itself at the hair of the avatar. Hair typically is very thin and practically takes less than one pixel in the created image. Therefore, in order to make it appear real, the color of the hair is blended with the color of the background. Blending is done according to the following formula:

P=alpha*hair_color+(1−alpha)*background_color

Where P is a three channel color (for example Red, Green and Blue) of a pixel, hair_color is the three channel color of the hair in pixel P, background_color is the three channel color of the background in pixel P and alpha is a channel transparency map—referred to as an alpha channel.

As can be seen from the mentioned above equation, each pixel is composed of an amount of hair color and an amount of background color. The ratio between the hair color amount and the background color amount is determined by the transparency channel. The transparency channel is a grayscale image containing floating point values between 0 to 1. A value of 0 means that the pixel gets only the color of the background while a value of 1 gives the pixel only the color of the hair. Any value between 0 and 1 mixes the colors of hair and background.

The values of hair_color and alpha depend on different parameters such as direction, pose and shape of the avatar as well as other such parameters. A machine learning process designed to render the avatar and the background needs to be trained in a way that would make the rendering appear as accurate and as real as possible.

There is a growing need to provide a solution that may generate alpha channel information in an effective manner—especially in real time and under a variety of person properties—such as pose, direction of gaze and expression.

The solution may be used for generating an avatar of a person—for example—during 3D virtual conference calls—but this is not necessarily so.

The solution should be accurate, simple (consume limited computational resources), and may provide alpha channel information in real time and/or under different person properties such as pose, direction and expression.

FIG. 16A illustrates an example of method 5200 for generating ground-truth alpha channel information.

The ground-truth alpha channel information can be used to train alpha channel machine learning process. The training results in an accurate generation of alpha channel of the alpha channel machine learning process.

Method 5200 may start by step 5210 of obtaining input images of the person.

Each image may include the person and an arbitrary background. The arbitrary background may include any compositions of objects and/or colors and/or shapes.

Step 5210 may be followed by step 5220 of converting each input image of the input images, by a portrait matting predictor, to (a) a first colored background image that includes the person (for example as acquired in the input image) and a background of a first color, and (b) a second colored background image that includes the person (for example as acquired in the input image) and a background of a second color; wherein the first color differs from the second color.

The portrait matting predictor may be a matting objective decomposition network (MODNET)—see, for example, “Is a Green Screen Really Necessary for Real-Time Portrait Matting?” Ke, Li, Zhou, Wu, Mao, Yan, and Lau, arXiv:2011.11961v2 [cs.CV] 29 Nov. 2020.

Thus—for each input image two ground-truth images are generated—the first colored background image and the second colored background image.

The portrait matting predictor may be a robust video mating (RVM) that may process a video stream and convert input images of the person (from example input images of a video stream) to multiple ground truth images at once. The RVM may process one input image at a time or process multiple input images at once.

Step 5220 may require a limited amount of resources by calculating only the mentioned above two ground truth images per input image. It has been found that using two sets of identical images with different backgrounds is enough to provide a robust and accurate solution.

Step 5220 generates ground truth alpha channel information for each ground-truth image.

FIG. 16B illustrates an example of method 5250 for training an alpha channel machine learning process.

Method 5250 may start by steps 5210 and 5280. Step 5280 may include obtaining ground truth alpha channel information related to each ground-truth image. Step 5280 may use the output of method 5200.

Step 5210 may include obtaining input images of the person. Each image may include the person and an arbitrary background.

Step 5260 may include obtaining ground truth alpha channel information related to each ground-truth image. The ground truth alpha channel information may be generated during method 5200.

Steps 5210 and 5260 may be followed by an iterative training process denoted 5270.

Iterative process 5270 may include:

-   -   a. Step 5271 of determining person properties within an input         image of the input images of step 5210. The input image may be         selected in any manner.     -   a. Step 5272 of generating, by the alpha channel neural network,         a first image, the first image is of the first avatar of the         person with the first colored background and having the person         properties.     -   b. Step 5273 of comparing (a) first alpha channel information         related to the first image to (b) ground-truth alpha channel         information related to the first colored background image to         provide a first comparison result.     -   c. Step 5274 of generating, by the alpha channel neural network,         a second image, the second image is of the second avatar of the         person with the second colored background and having the person         properties.     -   d. Step 5275 of comparing (a) second alpha channel information         related to the second image to (b) ground-truth alpha channel         information related to the second colored background image to         provide a second comparison result.     -   e. Step 5276 of using the first comparison result and the second         comparison result to correct the machine learning process. For         example—the first comparison result and the second comparison         result can be fed to a loss function, can be fed to a         self-learning machine learning process and the like.     -   f. Selecting another input image and jumping to step 5271.

FIG. 16C illustrates an example of method 5300 for generating alpha channel information related to a person.

Method 5300 may start by step 5310 of obtaining an alpha channel machine learning process. The alpha channel machine learning process may have been trained using method 5250.

Step 5310 may be followed by step 5320 of receiving images of the person. The images may be a part of a video stream and may be acquired during a 3D video conference call.

Step 5320 may be followed by step 5330 of generating, by the alpha channel machine learning process, alpha channel information related to the person.

The generating may be accurate, may be executed in real time and may be resource consuming effective.

Step 5330 may be followed by step 5340 of utilizing the alpha channel information related to the person.

The utilizing may include at least one out of:

-   -   a. Generating an avatar of the person. This may require         additional information and/or data structures such as a 3D model         and one or more texture maps.     -   b. Transmitting the alpha channel information related to the         person—especially the alpha channel information regarding to         multiple images acquired during the 3D video conference to one         or more other computerized systems that are involved in the 3D         video conference. The computerized system may be a user device,         a remote computerized system and the like.     -   c. Virtually including the avatar in a virtual 3D video         conference environment (V3DVCE).

Method 5300 may include generating an avatar or may be added to any method of generating an avatar—for example any method illustrated in at least one out of (a) U.S. patent application Ser. No. 17/249,468 filing date Mar. 2, 2021, (b) U.S. provisional patent Ser. No. 63/023,836 filing date May 12, 2020, (c) U.S. provisional patent Ser. No. 63/081,860 filing date Sep. 22, 2020, (d) U.S. provisional patent Ser. No. 63/199,014 filing date Dec. 1, 2020, (e) U.S. patent application Ser. No. 17/304,378 filing date Jun. 20, 2021, or (f) U.S. patent application Ser. No. 17/539,036 filing date Nov. 30, 2021, all being incorporated herein by reference.

Three-Dimensional Modeling and Rendering of Hair

With the increase of importance of video conferencing systems, new methods of implementing them are being introduced. These include 3D environments, where the video conference appears to be held in a virtual setting. The participants also appear as 3D figures within the virtual environment, usually represented as avatars. It is important to render avatars which appear as natural as possible and have them appear in the virtual environment just like real people in real environments.

An important element of rendering natural avatars, includes accurate modeling and rendering of hair. Hair is different than other parts of the body as hair has micro-scale structures and large numbers of repeated strands with heavy occlusions. Also, local plane similarity, which is a common assumption in general 3D reconstruction is not valid for hair. Modeling and rendering of hair should be performed in a cost-effective way as otherwise they can't be useful for real-time purposes.

Hair may be modeled as a mesh, a set of wisps or by a high density strand representation.

A mesh—this is a collection of vertices, edges and faces which describe 3D objects. An object modelled this way appears as one rigid object as described in C. Yuksel, S. Schaefer, J. Keyser. “Hair meshes”. ACM Transactions on Graphics (TOG) 28 (5), 1-7, 2009. Images 6001 and 6002 are taken from said article. There images provide an unnatural appearances and could be likened to helmets. Image 6001 of FIG. 17 illustrates an example of a hair mesh model and image 6002 of FIG. 17 illustrates a final hair model generated using the heir mesh and procedural styling operations.

Sets of wisps—each a small collection of strands, where each strand is composed of sets of points connected by straight lines. Images 6003 and 6004 of FIG. 17 are taken from L. Luo, H. Li, and S. Rusinkiewicz. “Structure-Aware Hair Capture”. ACM Transactions on Graphics (Proc. SIGGRAPH) 32(4), 2013. While this modeling is more detailed than a mesh, it still lacks important details, especially with strands that “fly” into the air. Image 6003 illustrates a picture of hair and image 6004 of FIG. 17 illustrates an example of heir synthesized using sets of wisps.

High density strand representation. Images 6005 and 6006 of FIG. 17 are taken from G. Nam, C. Wu, M. H. Kim, and Y. Sheikh. Strand-accurate multi-view hair capture. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. While this method may provide a reasonable outcome, it is extremely resource and time consuming and can't be used in real-time solutions. The method requires a very expensive setup of cameras and lighting sources and does not assist a subsequent rendering process. Image 6005 illustrates a picture of hair and image 6006 of FIG. 17 illustrates an example of hair modeling based in strands.

The suggested method provides a cost-effective way to model and enable rendering of hair modeled as strands.

The description applies to each specific person whose hair is being modeled.

First, using known methods or tools, such as depth cameras or triangulation of multi-view correspondences, a 3D map of point of hair is created. Each point can have RGB (and optionally also transparency) values attributed to them. In other solutions, such points can be used as a basis for a mesh model.

The next step is to attribute directions to directions to the points. These directions are a characteristic of the strand passing through the point. One way to perform this is to create 2D projections of the 3D hair map and to apply direction-finding operations on the images. Typically, this is performed with filters such as a Gabor filter, Sobel filter or other filters.

Image 6011 of FIG. 17A illustrated a picture of hair head. Image 6012 illustrates an output of a Gabor Filter on points extracted on the image.

Once 2D directions are found in the 2D projections of the 3D map, with enough such projections, it is possible to solve the optimization problem and to find the 3D directions for each point.

The next step attempts to cluster the 3D points into strands by their proximity and their similarity in the 3D directions as found in the previous step. This, in effect, creates a 3D mapping of the strands.

An example can be seen in graph 6013 of FIG. 1A. Strand segments are denoted 6014 while an estimated strand is denoted 6015.

At this point, the spatial location of the strands is known in the sense that one knows which points are connected to each other.

At the next step, a neural network (NN) is trained to create realistic looking hair of the person that is being modeled. Input to the NN can be the model created with some volume characteristics added to the modeled strands. For example, each strand segment can be characterized as a long cylinder or a truncated cone. FIG. 17B shows a virtual image 6020 with one such segment 6022 characterized as a long cylinder AND overlaid over a model of a head 6021.

Additional input to the NN includes the extrinsic and intrinsic parameters of the camera. Other possible inputs may include lighting or other parameters. The NN creates views of the hair at given angles and distances. As ground truth the NN can receive images captured of the specific person whose hair is being modeled or synthetic images that resemble the person's hair. The comparison to the ground truth can be used to verify occlusions, color, reflection and any other phenomena associated with images of hair. Once trained, the NN can be used to render hair as viewed from arbitrary angles and distances to appear natural and real.

This process is useful in both modeling hair in fine detail and accuracy and can be used to render hair for many purposes.

FIG. 17C illustrates an example of method 6030 for three dimensional modeling and rendering of head hair.

Method 6030 may start by step 6032 of obtaining a three dimensional (3D) point cloud of head hair points of a person.

Step 6032 may be followed by step 6034 of determining head hair points related 3D directions indicative of 3D directions of candidate strands that pass through the head hair points.

The head hair points of the 3D point cloud are associated with color and transparency Thus—a head hair point has its color value (for example RGB values) and transparency value.

Step 6034 may include applying Gabor filter or a Sobel filter.

Step 6034 may include:

-   -   a. Step 6034(1) of creating multiple two dimensional (2D)         projections of the 3D point cloud.     -   b. Step 6043(2) of applying a direction finding process on each         of the multiple 2D projections of the 3D point cloud to provide         multiple head hair points 2D direction related estimations.     -   c. Step 6043(3) of determining the head hair points 3D         directions based on the multiple head hair points 2D direction         estimations.

Step 6034 may be followed by step 6036 of grouping the hear hair points into multiple strand groups, wherein the grouping is based on distances between the head hair points and the head hair points 3D directions; wherein each strand group is associated with strand spatial information.

Step 6036 may be followed by step 6038 of generating, by a machine learning process, a representation of the head hair based on strand spatial information of the multiple strand groups and on strand shape information.

The machine learning process is trained to generate a representation of a head hair from any point of view.

The machine learning process may be trained by being fed with (a) images of head hair of persons that are acquired from different combinations of point of view and distance, and (b) models of the head hair of the person.

The representation is renderable from any point of view—as the machine learning process may be trained to generate representations that are renderable from any point of view.

Step 6038 may include receiving point of view and distance information and generating the representation as being viewed by a virtual camera having the point of view and located at the distance from the head hair.

The received point of view and the distance are arbitrary—thus the method is not confined to a few specific points of view. For example—the point of view and the distance are selected out of an arbitrary range of points of view and distances.

In step 6034 the multiple strand groups provide a volume-less represent of the strands—for example each strand may be represented by a virtual zero width line—for example the minimal or almost minimal line width—for example a width of less than one point, one point and the like. The strand shape information provides volumetric strand information.

Multiple repetitions of steps 6038 may be executed multiple times following changes in a requested point of view—wherein the repetitions may occur in real time (response period of less than 1 second—and the like)—even when the requested point of view differs from a point of view over which the machine learning process was trained.

FIG. 17D illustrates an example of a method 6100 for three dimensional modeling and rendering of head hair.

Method 6050 may include the following steps:

-   -   a. Step 6052 of obtaining a three dimensional (3D) point cloud         of head hair points of a person.     -   b. Step 6054 of determining head hair points related 3D         directions indicative of 3D directions of candidate strands that         pass through the head hair points.     -   c. Step 6056 of grouping the hear hair points into multiple         strand groups, wherein the grouping is based on distances         between the head hair points and the head hair points 3D         directions. Each strand group is associated with strand spatial         information.     -   d. Step 6058 of generating, by a machine learning process, head         hear metadata regarding head hair based on strand spatial         information of the multiple strand groups and on strand shape         information. The head hair metadata is renderable from any point         of view.

FIG. 17E illustrates an example of a method 6080 for three dimensional modeling and rendering of head hair.

Method 6080 may include the following steps:

-   -   a. Step 6082 of obtaining a machine learning process that is         trained to generate a representation of a head hair from any         point of view.     -   b. Step 6084 of receiving by the machine learning process, (a)         strand spatial information of multiple strand groups that are         formed from head hair points, and (b) strand shape information.     -   c. Step 6086 of generating, by the machine learning process, a         representation of the head hair based on the strand spatial         information of the multiple strand groups and on the strand         shape information.

Step 6086 may be also generated in response to a requested point of view.

Hand and Arm Movement

As indicated above, it is important to render avatars that represent participants of the 3D video conference to appear as natural as possible and have them appear in the virtual environment just like real people in real environments.

The system which creates the avatars may need to create full body avatars that replicate movements and gestures of the human participants despite the fact that, typically, participants are viewed by cameras that only see the top part of their bodies. It is important that the system would be able to create credible and natural appearing avatars, nevertheless.

It is assumed that while the viewing camera captures only the top part of the participant's body, it is still able to capture—at least partially—gestures that the participants perform with their hands. The arms, partially or fully, may be occluded behind objects—such as desks—or may be beyond the field of view of the camera.

People move their hands for many reasons. Sometimes the hand movements enhance what is being said. For example, research done on TED talks (https://www.scienceofpeople.com/hand-gestures/) shows that the speakers perform hundreds of gestures during the TED 18 minutes talk. However, people who don't speak also move their hands. For example, this research (https://pubmed.ncbi.nlm.nih.gov/25637115/) found that people touch their face at an average of 23 times an hour.

Typically, the camera captures only the hands and maybe some parts of the arms but not all of the arm. In order to create a natural looking avatar that performs the same gesture as the participant, it is important to be able to recreate full arm movement that is part of the gesture performed with the hand. This is due to the fact that the avatar within the virtual environment may be created with its arms fully viewable.

In order for the system to know how to render natural arm movements when the input is only the viewed hand gestures, a machine learning system—such as a neural network (NN) is trained to learn how arms movements correspond to hand gestures.

For the training purpose, a person whose arms are fully exposed to the camera, moves his or her hands. The NN is trained to learn how arms move for given hand gestures. The arm movements include both the postures of the arms and the speed of their movement as linked to the movements of the hands. The input to the NN are video clips—or sequences of images taken with very short periods of time between them—showing people perform many different gestures. The trained NN then knows how to place the arms whenever a partial gesture is captured by the camera and it allows rendering a natural-looking avatar performing the same gesture.

The training may take into account different clothing which the person wears. For example, wearing heavy clothes, such as jackets or coats, may induce different arm movements than those which happen when the person wears only light clothes.

Similarly, a different system may be trained to learn how a full body moves when only the top of the body is seen by the camera.

FIG. 18 illustrates an example of a method 6100 for avatar virtual arm movement.

Method 6100 may include step 6101 of obtaining (receiving or generating) information that maps hand gestures to arm movements. The information may be generated using training—for example any of the training mentioned above.

Method 6100 may include by step 6102 of capturing, by a camera of a participant of a video conference, a hand gesture made by the participant without capturing an arm movement of the participant, the arm movement is associated with the hand gesture.

Step 6102 may be followed by step 6104 of determining the arm movement of the participant, based on the hand gesture.

Step 6104 may be followed by step 6106 of determining a virtual movement of an arm of an avatar of the participant within a virtual video conference environment, the virtual movement of the arm of the avatar of the participant mimics the arm movement of the participant.

The determining of the virtual movement of the arm of the avatar comprises determining a trajectory and speed of the movement.

Step 6106 may be followed by step 6108 of assisting to perform the virtual arm movement of the avatar within the virtual video conference environment.

The assisting may include rendering, sending instructions for rendering, sending parameters for rendering (for example when the avatar is a participant is a parametric model), displaying, and the like.

Either one (or any combination of) of steps 6104, 6106 and 6108 may be executed by a machine learning process. The same machine learning process may execute steps 6104, 6106 and 6108—but different machine learning processes may execute steps 6104, 6106 and 6108. The same applies to any combination of two steps out of 6104, 6106 and 6108.

Step 6106 and/or step 6108 may include determining one or more arm movement restrictions and wherein the determining of the virtual movement of the arm of the avatar is based on the one or more arm movement restrictions.

The one or more arm movement restrictions may include an arm movement restriction imposed by a clothing of the participant.

The arm movement restriction imposed by the clothing of the participant may be a function of a weight of the clothing and/or an elasticity of the clothing and/or a gap between the arm and at least one part of clothing that at least partially surrounds the arm.

Virtual Non-Verbal Communication

A system which creates the avatars may need to create full body avatars that replicate movements and gestures of the human participants even when gestures are made towards items in the virtual environment. Moreover, it is important to be able to recreate concurrent multiple gestures from the real world of the participant to the virtual environment.

A participant of a virtual 3D video conference would typically be looking at a large screen or at a multitude of screens. Either way, the participant has the sensation of actually being in a real 3D environment in which many virtual items may exist. These may include avatars of other participants, boards, pictures, furniture or any other virtual object. The participant itself also has an avatar representing it and that avatar is seen by the other participants of the virtual video conference. When the participant looks at an avatar of a second participant, other viewers see the avatar representing the first participant appearing to be looking at the avatar representing the second participant (in this case, the second person will see the avatar representing the first participant appear to be looking at him from the screen). This has been described elsewhere.

In addition, the first participant may be performing additional gestures aimed at virtual items within the virtual environment. For example, the participant may be pointing with one of his hands at an avatar representing a third participant or at another virtual item within the virtual environment. In this case, the second gesture is also replicated within the virtual environment. This means that the avatar of the first participant would also appear to be performing the second gesture in addition to the first gesture mentioned above (looking at an avatar of a second participant).

In the general case, each participant may be performing concurrently multiple gestures aimed at one or more virtual items within the virtual video conferencing environment. In these cases, the avatar representing the first participant in the virtual environment would appear to also be performing the same gestures aimed at the same one or more virtual items within the virtual environment.

FIG. 19 illustrates an example of a method 6130 of performing virtual non-verbal communication cues within a virtual environment of a video conference.

Step 6130 may start by step 6132 of sensing that a certain participant of a video conference performs non-verbal communication cues in relation to virtual items that appear in a version of a virtual video conference environment that is displayed on a display of the certain participant.

Step 6132 may be followed by step 6134 of determining, by a machine learning process, virtual non-verbal communication cues to be executed by a certain avatar that represents the certain participant, wherein the virtual non-verbal communication cues mimic the non-verbal communication cues performed by the certain participant.

Step 6134 may be followed by step 6136 of assisting to perform the virtual non-verbal communication cues of the certain avatar within the virtual video conference environment.

The virtual non-verbal communication cues may include at least one out of eye contact, one or more hand gestures, or one or more head movements.

The virtual items may include two or more avatars of two or more other participants or two or more virtual items that are not avatars of participants. A combination of at least one avatar and at least one non-avatar may be provided.

The at least some of the non-verbal communication cues may be executed concurrently.

There may be a difference between the non-verbal communication cues and the virtual non-verbal communication cues.

At least one virtual non-verbal communication cue of the non-verbal communication cues may be determined regardless of at least one non-verbal communication cue made by the certain participant.

FIG. 19A illustrates an example of an image of a participant 6141 making a “thumbs up” gesture and the virtual representation 6142 of the thumb up gesture as made by an avatar (not shown) of the participant.

Moreover, the terms “front,” “back,” “top,” “bottom,” “over,” “under” and the like in the description and in the claims, if any, are used for descriptive purposes and not necessarily for describing permanent relative positions. It is understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are, for example, capable of operation in other orientations than those illustrated or otherwise described herein.

The connections as discussed herein may be any type of connection suitable to transfer signals from or to the respective nodes, units, or devices, for example via intermediate devices. Accordingly, unless implied or stated otherwise, the connections may for example be direct connections or indirect connections. The connections may be illustrated or described in reference to be a single connection, a plurality of connections, unidirectional connections, or bidirectional connections. However, different embodiments may vary the implementation of the connections. For example, separate unidirectional connections may be used rather than bidirectional connections and vice versa. Also, plurality of connections may be replaced with a single connection that transfers multiple signals serially or in a time multiplexed manner. Likewise, single connections carrying multiple signals may be separated out into various different connections carrying subsets of these signals. Therefore, many options exist for transferring signals.

Any arrangement of components to achieve the same functionality is effectively “associated” such that the desired functionality is achieved. Hence, any two components herein combined to achieve a particular functionality may be seen as “associated with” each other such that the desired functionality is achieved, irrespective of architectures or intermedial components. Likewise, any two components so associated can also be viewed as being “operably connected,” or “operably coupled,” to each other to achieve the desired functionality.

Furthermore, those skilled in the art will recognize that boundaries between the above-described operations merely illustrative. The multiple operations may be combined into a single operation, a single operation may be distributed in additional operations and operations may be executed at least partially overlapping in time. Moreover, alternative embodiments may include multiple instances of a particular operation, and the order of operations may be altered in various other embodiments.

Also, for example, in one embodiment, the illustrated examples may be implemented as circuitry located on a single integrated circuit or within a same device. Alternatively, the examples may be implemented as any number of separate integrated circuits or separate devices interconnected with each other in a suitable manner.

However, other modifications, variations and alternatives are also possible. The specifications and drawings are, accordingly, to be regarded in an illustrative rather than in a restrictive sense.

In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word ‘comprising’ does not exclude the presence of other elements or steps then those listed in a claim. Furthermore, the terms “a” or “an,” as used herein, are defined as one or more than one. Also, the use of introductory phrases such as “at least one” and “one or more” in the claims should not be construed to imply that the introduction of another claim element by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim element to embodiments of the disclosure s containing only one such element, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an.” The same holds true for the use of definite articles. Unless stated otherwise, terms such as “first” and “second” are used to arbitrarily distinguish between the elements such terms describe. Thus, these terms are not necessarily intended to indicate temporal or other prioritization of such elements. The mere fact that certain measures are recited in mutually different claims does not indicate that a combination of these measures cannot be used to advantage.

While certain features of the embodiments of the disclosure have been illustrated and described herein, many modifications, substitutions, changes, and equivalents will now occur to those of ordinary skill in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the embodiments of the disclosure. 

We claim
 1. A method for performing virtual non-verbal communication cues within a virtual environment of a video conference, the method comprises: sensing that a certain participant of a video conference performs non-verbal communication cues in relation to virtual items that appear in a version of a virtual video conference environment that is displayed on a display of the certain participant; determining, by a machine learning process, virtual non-verbal communication cues to be executed by a certain avatar that represents the certain participant, wherein the virtual non-verbal communication cues mimic the non-verbal communication cues performed by the certain participant; and assisting to perform the virtual non-verbal communication cues of the certain avatar within the virtual video conference environment.
 2. The method according to claim 1 wherein the virtual non-verbal communication cues comprise an eye contact.
 3. The method according to claim 1 wherein the virtual non-verbal communication cues comprise one or more hand gestures.
 4. The method according to claim 1 wherein the virtual non-verbal communication cues comprise one or more head movements.
 5. The method according to claim 1 wherein the virtual items comprise two or more avatars of two or more other participants.
 6. The method according to claim 1 wherein the virtual items comprises two or more virtual items that are not avatars of participants.
 7. The method according to claim 1 wherein at least some of the non-verbal communication cues are executed concurrently.
 8. The method according to claim 1 wherein there is a difference between the non-verbal communication cues and the virtual non-verbal communication cues.
 9. The method according to claim 6 wherein a determining of at least one virtual non-verbal communication cue of the non-verbal communication cues is made regardless of at least one non-verbal communication cue made by the certain participant.
 10. A non-transitory computer readable medium for performing virtual non-verbal communication cues within a virtual environment of a video conference, the non-transitory computer readable medium that stores instructions for: sensing that a certain participant of a video conference performs non-verbal communication cues in relation to virtual items that appear in a version of a virtual video conference environment that is displayed on a display of the certain participant; determining, by a machine learning process, virtual non-verbal communication cues to be executed by a certain avatar that represents the certain participant, wherein the virtual non-verbal communication cues mimic the non-verbal communication cues performed by the certain participant; and assisting to perform the virtual non-verbal communication cues of the certain avatar within the virtual video conference environment.
 11. The non-transitory computer readable medium according to claim 10 wherein the virtual non-verbal communication cues comprise an eye contact.
 12. The non-transitory computer readable medium according to claim 10 wherein the virtual non-verbal communication cues comprise one or more hand gestures.
 13. The non-transitory computer readable medium according to claim 10 wherein the virtual non-verbal communication cues comprise one or more head movements.
 14. The non-transitory computer readable medium according to claim 10 wherein the virtual items comprise two or more avatars of two or more other participants.
 15. The non-transitory computer readable medium according to claim 10 wherein the virtual items comprises two or more virtual items that are not avatars of participants.
 16. The non-transitory computer readable medium according to claim 10 wherein at least some of the non-verbal communication cues are executed concurrently.
 17. The non-transitory computer readable medium according to claim 10 wherein there is a difference between the non-verbal communication cues and the virtual non-verbal communication cues.
 18. The non-transitory computer readable medium according to claim 17 wherein a determining of at least one virtual non-verbal communication cue of the non-verbal communication cues is made regardless of at least one non-verbal communication cue made by the certain participant. 